1 Introduction and BBC iPlayer streaming data

2 Assignment

3 Learning Objectives

3.1 How to get the most out of this exercise

4 Cleaned Data

I have already processed and cleaned the original view data. In this step you will first generate a user-based database which we will use to train clustering algorithms to identify meaningful clusters in the data.

Let’s load the cleaned data and investigate what’s in the data. See below for column descriptions.

cleaned_BBC_Data <- read_csv(file="Results_Step1.csv",col_names = TRUE)
library(dplyr)
glimpse(cleaned_BBC_Data) 
## Rows: 313,256
## Columns: 17
## $ user_id                   <chr> "cd2006", "cd2006", "cd2006", "cd2006", "cd2…
## $ program_id                <chr> "b8fbf2", "e2f113", "0e0916", "ca03b9", "cfe…
## $ series_id                 <chr> "e0480e", "933a1b", "b68e79", "5d0813", "eba…
## $ genre                     <chr> "Comedy", "Factual", "Entertainment", "Sport…
## $ start_date_time           <dttm> 2017-01-19 22:17:04, 2017-02-14 19:12:36, 2…
## $ streaming_id              <chr> "1484864257965_1", "1487099603980_1", "14847…
## $ prog_duration_min         <dbl> 1.850000, 0.500000, 1.366667, 1.616667, 8.50…
## $ time_viewed_min           <dbl> 1.85000000, 0.49908333, 1.36666667, 1.616666…
## $ duration_more_30s         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ time_viewed_more_5s       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ percentage_program_viewed <dbl> 1.000000000, 0.998166667, 1.000000000, 1.000…
## $ watched_more_60_percent   <dbl> 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0,…
## $ month                     <dbl> 1, 2, 1, 2, 3, 2, 4, 3, 4, 3, 3, 4, 3, 3, 2,…
## $ day                       <dbl> 5, 3, 4, 1, 1, 3, 1, 6, 7, 7, 1, 5, 7, 3, 2,…
## $ hour                      <dbl> 22, 19, 21, 14, 20, 19, 20, 21, 21, 20, 18, …
## $ weekend                   <chr> "weekday", "weekday", "weekday", "weekend", …
## $ time_of_day               <chr> "Evening", "Evening", "Evening", "Day", "Eve…

Before we proceed let’s consider the usage in January only.

cleaned_BBC_Data<-filter(cleaned_BBC_Data,month==1)

5 User based data

We will try to create meaningful customer segments that describe users of the BBC iPlayer service. First we need to change the data to user based and generate a summary of their usage.

5.1 Data format

The data is presented to us in an event-based format (every row captures a viewing event). However we need to detect the differences between the general watching habits of users.

How can you convert the current date set to a customer-based dataset (i.e., summarizes the general watching habits of each user). In what dimensions could BBC iPlayer users be differentiated? Can you come up with variables that capture these? Discuss these issues with your group and determine a strategy on how data must be processed

5.2 Feature Engineering

For the workshop let’s generate the following variables for each user.

  1. Total number of shows watched and ii. Total time spent watching shows on iPlayer by each user in the data
userData<-cleaned_BBC_Data %>%
  group_by(user_id) %>% 
  summarise(noShows=n(), total_Time=sum(time_viewed_min)) 
  1. Proportion of shows watched during the weekend for each user.
#Let's find the number of shows on weekend and weekdays
userData2<-cleaned_BBC_Data %>% group_by(user_id,weekend) %>% summarise(noShows=n())

#Let's find percentage in weekend and weekday
userData3 = userData2%>% group_by(user_id) %>% mutate(weight_pct = noShows / sum(noShows))

#Let's create a data frame with each user in a row.
userData3<-select (userData3,-noShows)
userData3<-userData3%>% spread(weekend,weight_pct,fill=0) %>%as.data.frame()
#Let's merge the final result with the data frame from the previous step.
userdatall<-left_join(userData,userData3,by="user_id")
  1. Proportion of shows watched during different times of day for each user.
#Code in this block follows the same steps above.
userData2<-cleaned_BBC_Data %>%
  group_by(user_id,time_of_day) %>% 
  summarise(noShows=n()) %>%
  mutate(weight_pct = noShows / sum(noShows))

userData4<-select (userData2,-c(noShows))
userData4<-spread(userData4,time_of_day,weight_pct,fill=0)

userdatall<-left_join(userdatall,userData4,by="user_id")

Question 1. Find the proportion of shows watched in each genre by each user. Your code below.

userdatall <- cleaned_BBC_Data %>%
  group_by(user_id,genre) %>%
  summarise(noShows=n()) %>% 
  mutate(weight_pct = noShows / sum(noShows)) %>% 
  select(-c(noShows)) %>% 
  spread(genre, weight_pct, fill=0) %>% 
  left_join(userdatall, ., by="user_id")

Question 2. Add one more variable of your own. Describe why this might be useful for differentating viewers in 1 or 2 lines. Your code below.

userdatall<-cleaned_BBC_Data %>%
  mutate(program_viewed_percentage = time_viewed_min/prog_duration_min) %>% 
  group_by(user_id) %>% 
  summarise(median_viewed_percentage = median(program_viewed_percentage),
            sd_viewed_percentage = sd(program_viewed_percentage)) %>% 
  left_join(userdatall, ., by='user_id')

We decided to add two variables. The first was the median percentage viewed of the programs a user watched. This gives us insight into the median percentage that each user completes of a program when they watch. This could potentially be useful for separating users based on their habits of completing shows. We also calculated the standard deviation of the program viewed percentage to see how much the percentage viewed deviates for a user. Some of the values of the standard deviation are NA’s because customers only have one instance of program viewing. This is not a issue as we filter out these infrequent users later on.

6 Visualizing user-based data

Next visualize the information captured in the user based data. Let’s start with the correlations.

library("GGally")
userdatall %>% 
  select(-user_id) %>% #keep Y variable last
  ggcorr(method = c("pairwise", "pearson"), layout.exp = 3,label_round=2, label = TRUE,label_size = 2,hjust = 1)

Question 3. Which variables are most correlated? What’s the implication of this for clustering?

The variables total watching time and number of shows watched are highly correlated. It may indicate one of those variables is redundant in the clustering analysis. Those users who watch more shows logically tend to watch longer.

Question 4. Investigate the distribution of noShows and total_Time using box-whisker plots and histograms. Explain what you observe in 1-2 sentences. Are you worried about outliers?

From the box plots and histograms we observe that the distributions of noShows and total_Time are right skewed. In addition, there exist outliers in the box plot of noShows and there appear to be outliers in the histogram of total_Time. We are not worried about these outliers as they could represent a real cluster of users which could be extremely active users who use iPlayer frequently.

g1 <- userdatall %>% 
  ggplot(aes(x=log(total_Time)))+
  geom_boxplot()

g2 <- userdatall %>% 
  ggplot(aes(x=total_Time))+
  geom_histogram()
  
g3 <- userdatall %>% 
  ggplot(aes(x=log(noShows)))+
  geom_boxplot()
  
g4 <- userdatall %>% 
  ggplot(aes(x=log(noShows)))+
  geom_histogram()

(g1+g2)/(g3+g4)

6.1 Delete infrequent users

Delete the records for users whose total view time is less than 5 minutes and who views 5 or fewer programs. These users are not very likely to be informative for clustering purposes. Or we can view these users as a ‘low-engagement’ cluster.

userdata_red <- userdatall %>%
  filter(total_Time>=5) %>%
  filter(noShows>=5)
ggplot(userdata_red, aes(x=total_Time))+
  geom_histogram(binwidth=25)+
  labs(x="Total Time Watched (mins)", y= "Count")

7 Clustering with K-Means

Now we are ready to find clusters in the BBC iPlayer viewers. We will start with the K-Means algorithm.

7.1 Training a K-Means Model

Train a K-Means model. Start with 2 clusters and make sure you de-select user_id variable. Also don’t forget to scale the data. Use 50 random starts. Should we use more starts?

Also display the cluster sizes. See the RMarkdown file from the last session to identify the R functions you need for this and the tasks below.

Use summary("kmeans Object") to examine the components of the results of the clustering algorithm. How many points are in each cluster?

set.seed(1234)

k=2
# Get rid of variables that you might not need. Do not include no shows as well because it is highly correlated with total time
clusters_bbc_data <- userdata_red %>% 
  mutate(log_total_time = log(total_Time)) %>%#log transform total time to reduce the impact of outliers 
  select(-c(user_id,noShows,total_Time)) %>% 
  scale() #scale the data
clusters_bbc <- clusters_bbc_data %>% 
  eclust(x=., FUNcluster = "kmeans", k = k, nstart = 50, graph = FALSE) #train kmeans clustering

#summary(clusters_bbc)
#add clusters to the data frame
userdata_red$cluster <- clusters_bbc$cluster

# size of clusters
clusters_bbc$size
## [1] 1133 2492

7.2 Visualizing the results

7.2.1 Cluster centers

Plot the normalized cluster centers. Try to describe the clusters that the algorithm suggests.

cluster_centers <- data.frame(cluster=as.factor(c(1:2)),
                              clusters_bbc$centers)

#transpose this data frame
cluster_centers_t <- cluster_centers %>% 
  gather(variable,value,-cluster,factor_key = TRUE)

#plot the centers
ggplot(cluster_centers_t, aes(x = variable, y = value))+
  geom_line(aes(color =cluster,group = cluster), linetype = "dashed",size=1)+
  geom_point(size=1,shape=4)+geom_hline(yintercept=0)+
  theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),)+
  ggtitle("K-means Centers k=2")+
  scale_fill_brewer(palette = "Set4")

Can you interpret each cluster from this plot? Did you arrive at meaningful clusters?

Cluster 1 like to watch on the weekends during the afternoon and Cluster 2 like to watch on weekdays during the evening. For the weekends watchers, the genres they are interested in are likely to be Entertainment, Music and Sports. For the other cluster of watchers, they like to watch Drama, Factual and Weather programmes, and they generally spend more time on TV and are more likely to finish the whole programme.

How can you use the cluster information to improve the viewer experience with BBC iPlayer? We will come back to these points below. However it is important to think about these issues at the beginning.

With the knowledge above, I would recommend them to schedule more of the corresponding programmes preferred by each cluster on weekdays and weekends during the times that the different clusters watch. This would allow users to find more of the shows they like and improve their experience. BBC could also negotiate with ads providers about the lengths and prices for weekday programmes because Cluster 2 clearly have more probability to pay attention to the ads (with greater median_percent_viewed and log(total_time)). On the other hand, reducing the lengths of ads would probably improve the user experience.

7.2.2 Clusters vs variables

Plot a scatter plot for the viewers with respect to total_Time and weekend variables with color set to the cluster number of the user. What do you observe? Which variable seems to play a more prominent role in determining the clusters of users?

The weekend variable plays a more important role in determining the clusters of users. Cluster 2 seems to spend more time watching but it is not as important as the weekend in determining the clusters.

userdata_red %>% 
  ggplot(aes(x = log(total_Time), y = weekend, color = as.factor(cluster))) +
  geom_jitter()+
  labs(color = "Cluster")+
  NULL

7.2.3 Clusters vs PCA components

Repeat the previous step and use the first two principle components using fviz_cluster function.

fviz_cluster(clusters_bbc, clusters_bbc_data, palette = "Set2",
             geom = "point",
             ggtheme = theme_minimal())

7.2.4 Clusters vs PCA components without log transform

As a “side exercise”, use K-means method again but this time do not log transform total time and include no_shows as well. Compare your results to the case when you use log transformation. Then visualize the first two principle components using fviz_cluster function.

userdata_red %>% 
  select(-c(user_id)) %>% 
  scale() %>% #scale the data
  eclust(x=., FUNcluster = "kmeans", k = 2, nstart = 50, graph = F) %>% #train kmeans clustering
  fviz_cluster(userdata_red, palette = "Set2",
             geom = "point",
             ggtheme = theme_minimal())

Do you observe any outliers?

Yes, without taking the log of total time and including no_shows, we can see there are a few outliers that are included in cluster 2.

Since we are looking for general tendencies, we want to remove the ouliers as PCA is very sensative to outliers. We want to make the points in the cluters as similiar to themselves as possible (low within cluter variance), and the inclusion of the outliers makes this hard to do and can lead to misleading results.

7.3 Elbow Chart

Produce an elbow chart and identify a reasonable range for the number of clusters.

fviz_nbclust(clusters_bbc_data,
  kmeans,
  k.max = 20,
  method = "wss")+
  labs(subtitle = "Elbow method")

7.4 Silhouette method

Repeat the previous step for Silhouette analysis.

fviz_nbclust(clusters_bbc_data,
             kmeans, method = "silhouette",
             k.max = 20)+
  labs(subtitle = "Silhouette method")

Question 5: Summarize the conclusions of your Elbow Chart and Silhoutte analysis. What range of values for the number of clusters seems more plausible?

When looking at this elbow chart alone, we get to see what happens when too mant data points are in the middle. As one can see, the sum of the square errors is constantly decreasing as we increase number of K clusters so there is no obvious optimal level of K when we look at this elbow chart. Usually, one will see a line that looks like an elbow, with a rapid decrease SSE when one initially increases K, and then one a threshold is reached, the change in SSE diminishes a lot. Here, we can see that at K=17 the SSE stops decreasing as much, and at k=19, it starts to increase. According to this elbow chart, the SSE is the lowest when K = 19, however we need to take into account the meaningfulness of clustering if we have 19 different clusters.

The silhouette analysis shows us how well matched the data points in each cluster are to each other. We would want the mean Silhoutte Score, which is the average similarity within clusters, to be high, as we want to increase the number of clusters as long as the members within each cluster are similar and the clusters keep on being different from each other. When we go from k=4 to k=5 the SW decreases a significant amount, telling us there there is not as much similarity within the clusters when we have 5 clusters.

Conclusion: 4 can be a good number of clusters

7.5 Comparing k-means with different k

Question 6: For simplicity let’s focus on lower values. Now find the clusters using kmeans for k=3, 4 and 5. Plot the centers and check the number of observations in each cluster. Based on these graphs which one seems to be more plausible? Which clusters are observable in each case? Don’t forget to check the cluster sizes.

K=4 seems the most plausible due to the higher silhouette score compared to k=5. K=4 adds a new distinct cluster of users who watch a lot of weather and news which K=3 fails to capture. Cluster 4 in K=4 seems to have the majority of users with a size of 1983. These are the users who prefer to watch factual and drama content during the weekdays.

#Fit kmeans models
model_km3 <- clusters_bbc_data %>% #scale the data
  eclust(x=., FUNcluster = "kmeans", k = 3, nstart = 50, graph = FALSE)
model_km4 <- clusters_bbc_data %>% #scale the data
  eclust(x=., FUNcluster = "kmeans", k = 4, nstart = 50, graph = FALSE)
model_km5 <- clusters_bbc_data %>% #scale the data
  eclust(x=., FUNcluster = "kmeans", k = 5, nstart = 50, graph = FALSE)
# model_km3$size 2380 1066  179
# model_km4$size 940  178  524 1983
# model_km5$size 1264  494  724  961  182
# PCA visualizations
# plots to compare
# I use the fviz_cluster function which is part of the`factoextra` library
p1 <- fviz_cluster(clusters_bbc, geom = "point", data = clusters_bbc_data) + ggtitle("k = 2")
p2 <- fviz_cluster(model_km3, geom = "point",  data = clusters_bbc_data) + ggtitle("k = 3")
p3 <- fviz_cluster(model_km4, geom = "point",  data = clusters_bbc_data) + ggtitle("k = 4")
p4 <- fviz_cluster(model_km5, geom = "point",  data = clusters_bbc_data) + ggtitle("k = 5")
grid.arrange(p1,p2,p3,p4, nrow = 2)

#Plot centers
#Plot centers for k=3
xa<-data.frame(cluster=as.factor(c(1:3)),model_km3$centers)
xa2k3<-xa %>% gather(variable,value,-cluster,factor_key = TRUE)

graphknn3<-ggplot(xa2k3, aes(x = variable, y = value))+
  geom_line(aes(color =cluster,group = cluster), linetype = "dashed",size=1)+
  geom_point(size=1,shape=4)+geom_hline(yintercept=0)+
  theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),legend.title=element_text(size=5),
        legend.text = element_text(size=5))+ggtitle("K-means Centers k=3")+
  scale_fill_brewer(palette = "Set4")

#Plot centers for k=4
xa<-data.frame(cluster=as.factor(c(1:4)),model_km4$centers)

xa4<-xa %>% gather(variable,value,-cluster,factor_key = TRUE)
graphknn4<-ggplot(xa4, aes(x = variable, y = value))+
  geom_line(aes(color = cluster,group = cluster), linetype = "dashed",size=1)+
  geom_point(size=1,shape=4)+geom_hline(yintercept=0)+
  theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),legend.title=element_text(size=5),legend.text = element_text(size=5))+
  ggtitle("K-means Centers k=4")+
  scale_fill_brewer(palette = "Set4")

#Plot centers for k=5
xa<-data.frame(cluster=as.factor(c(1:5)),model_km5$centers)

xa2<-xa %>% gather(variable,value,-cluster,factor_key = TRUE)
graphknn5<-ggplot(xa2, aes(x = variable, y = value))+
  geom_line(aes(color = cluster,group = cluster),
            linetype = "dashed",size=1)+
  geom_point(size=1,shape=4)+geom_hline(yintercept=0)+
  theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),
        legend.title=element_text(size=5),
        legend.text = element_text(size=5))+
  ggtitle("K-means Centers k=5")+
  scale_fill_brewer(palette = "Set4")

# model_km3$size
# model_km4$size
# model_km5$size
graphknn3

graphknn4 

graphknn5

# Compare Silhouette Graphs
s2<-fviz_silhouette(clusters_bbc)+
  ggtitle(paste("k = 2", "avg sw=",
                format(round(clusters_bbc$silinfo$avg.width,3))))
##   cluster size ave.sil.width
## 1       1 1133          0.00
## 2       2 2492          0.17
s3<-fviz_silhouette(model_km3)+
  ggtitle(paste("k = 3", "avg sw=",
                format(round(model_km3$silinfo$avg.width,3))))
##   cluster size ave.sil.width
## 1       1 2380          0.17
## 2       2 1066          0.01
## 3       3  179          0.15
s4<-fviz_silhouette(model_km4)+
  ggtitle(paste("k = 4", "avg sw=",
                format(round(model_km4$silinfo$avg.width,3))))
##   cluster size ave.sil.width
## 1       1  940         -0.01
## 2       2  178          0.14
## 3       3  524          0.11
## 4       4 1983          0.17
s5<-fviz_silhouette(model_km5)+
  ggtitle(paste("k = 5", "avg sw=",
                format(round(model_km5$silinfo$avg.width,3))))
##   cluster size ave.sil.width
## 1       1 1264          0.03
## 2       2  494          0.11
## 3       3  724         -0.04
## 4       4  961          0.20
## 5       5  182          0.13
grid.arrange(s2,s3,s4,s5, nrow = 2)

8 Comparing results of different clustering algorithms

8.1 PAM

Fit a PAM model for the k value you chose above for k-means. Determine how many points each cluster has. Plot the centers of the clusters and produce PCA visualization.

#Here we use pam clustering with k=4
k=4
k4_pam <-eclust(clusters_bbc_data, "pam", k = 4, graph = FALSE)

#First generate a new data frame with cluster medoids and cluster numbers
cluster_medoids <- data.frame(cluster=as.factor(c(1:k)), k4_pam$medoids)

#transpose this data frame so 
cluster_medoids_t <- cluster_medoids %>%
  gather(variable,value,-cluster,factor_key = TRUE)

#plot medoids
graphkmeans_4Pam <- ggplot(cluster_medoids_t, aes(x = variable, y = value))+ 
  geom_line(aes(color =cluster,group = cluster), linetype = "dashed",size=1)+
  geom_point(size=1,shape=4)+geom_hline(yintercept=0)+
  theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),)+
  ggtitle("Pam Medoids k=4")+
  scale_fill_brewer(palette = "Set4")

(graphkmeans_4Pam / graphknn4)

(p4+
fviz_cluster(k4_pam, data = clusters_bbc_data, palette="Set4",
             geom= "point",
             ggtheme = theme_minimal()))

PAM_silhouette <- fviz_silhouette(k4_pam)+
  ggtitle(paste("PAM, k = 4", "avg sw=",
                format(round(k4_pam$silinfo$avg.width,3))))
##   cluster size ave.sil.width
## 1       1  941          0.15
## 2       2 1393          0.04
## 3       3 1123         -0.06
## 4       4  168          0.16
(s4 + PAM_silhouette)

8.2 H-Clustering

Use Hierercahial clustering with the same k you chose above. Set hc_method equal to average and then ward.D. What differences do you observe between the results of these two methods? Visualize the results using dendrograms. How many points does each cluster have? Plot the centers of the clusters and produce PCA visualization.

When ward.D is used as the agglomeration method it produces a dendrogram with a much greater height compared to when average is used. When average is used the cluster sizes become unreliable with two clusters having a size of 1.

#dist function find the distances between points
res.dist <- dist(clusters_bbc_data, method = "euclidean")
res.hc <- hcut(res.dist, hc_method = "ward.D",k=4)
res.hc_avg <- hcut(res.dist, hc_method = "average",k=4)

#Dendrogram with hc_method = "word.D"
plot(res.hc,hang = -1, cex = 0.5, labels = FALSE)

#Dendrogram with hc_method = "average"
plot(res.hc_avg,hang = -1, cex = 0.5, labels = FALSE)

# res.hc$size 1518  485 1453  169
# res.hc_avg$size 3605   18    1    1

Plot the centers of H-clusters and compare the results with K-Means and PAM.

Let’s look at cluster centers.

#First let's find the averages of the variables by cluster
clusters_bbc_data_hc<-mutate(data.frame(clusters_bbc_data), 
                                   cluster = as.factor(res.hc$cluster))

center_locations <- clusters_bbc_data_hc %>%
  group_by(cluster) %>%
  summarize_at(vars(weekday:log_total_time),mean)

#Next I use gather to collect information together
xa2<- gather(center_locations, key = "variable", value = "value",-cluster,factor_key = TRUE)

#Next I use ggplot to visualize centers
hclust_center<-ggplot(xa2, aes(x = variable, y = value,order=cluster))+
  geom_line(aes(color = cluster,group = cluster), linetype = "dashed",size=1)+
  geom_point(size=2,shape=4)+
  geom_hline(yintercept=0)+
  theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),)+
  ggtitle("H-clust K=3")+
  labs(fill = "Cluster")+
  scale_fill_brewer(palette = "Set4")
## Compare it with KMeans
(hclust_center/graphknn4/graphkmeans_4Pam)

p4 <- fviz_cluster(model_km4, geom = "point",  data = clusters_bbc_data) + ggtitle("Kmeans, k = 4")
p4_hc <- fviz_cluster(res.hc, geom = "point",  data = clusters_bbc_data) + ggtitle("H-Clustering, k = 4")
(p4+p4_hc)

s4_hc <- fviz_silhouette(res.hc) + ggtitle(paste("H-Clustering, k = 4", "avg sw=",
                format(round(res.hc$silinfo$avg.width,3))))
##   cluster size ave.sil.width
## 1       1 1518          0.17
## 2       2  485          0.10
## 3       3 1453         -0.09
## 4       4  169          0.15
s4<-fviz_silhouette(model_km4)+
  ggtitle(paste("K-means, k = 4", "avg sw=",
                format(round(model_km4$silinfo$avg.width,3))))
##   cluster size ave.sil.width
## 1       1  940         -0.01
## 2       2  178          0.14
## 3       3  524          0.11
## 4       4 1983          0.17
(s4+s4_hc)

Question 7: Based on the results of these three methods, what can you conclude?

After comparing the results of the three methods we can conclude that they all produce very similar clusters.This tells us that there is reason to believe that these clusters do exist within the data. In addition we can conclude that K-means produces the most reliable clusters due to the higher silhouette score compared to PAM and H-Clust.

9 Subsample check

At this stage you must have chosen the number of clusters. We will try to reinforce your conclusions and verify that they are not due to chance by dividing the data into two equal parts. Use K-means clustering, fixing the number of clusters to your choice, in these two data sets separately. If you get similar looking clusters, you can rest assured that you conclusions are robust. If not you might want to reconsider your decision.

library(rsample)
#the following code chunk splits the data into two. Replace ... with your data frame that contains the data
set.seed(1234)
train_test_split <- initial_split(data.frame(clusters_bbc_data), prop = 0.5)
testing <- testing(train_test_split) #50% of the data is set aside for testing
training <- training(train_test_split) #50% of the data is set aside for training

#Fit k-means to each dataset and compare your results
model_km4_train <- training %>% #scale the data
  eclust(x=., FUNcluster = "kmeans", k = 4, nstart = 50, graph = FALSE)

model_km4_test <-testing %>% #scale the data
  eclust(x=., FUNcluster = "kmeans", k = 4, nstart = 50, graph = FALSE)

p_train <- fviz_cluster(model_km4_train, geom = "point",  data = clusters_bbc_data) + ggtitle("k = 4 train")
p_test <- fviz_cluster(model_km4_test, geom = "point",  data = clusters_bbc_data) + ggtitle("k = 4 test")

p_train + p_test

cluster_centers_train <- data.frame(cluster=as.factor(c(1:4)),
                              model_km4_train$centers)

cluster_centers_test <- data.frame(cluster=as.factor(c(1:4)),
                              model_km4_test$centers)

#transpose this data frame
cluster_centers_train_transpose <- cluster_centers_train %>% 
  gather(variable,value,-cluster,factor_key = TRUE)

cluster_centers_test_transpose <- cluster_centers_test %>% 
  gather(variable,value,-cluster,factor_key = TRUE)

#plot the centers
graph_cluster_centers_train <- ggplot(cluster_centers_train_transpose, aes(x = variable, y = value))+
  geom_line(aes(color =cluster,group = cluster), linetype = "dashed",size=1)+
  geom_point(size=1,shape=4)+geom_hline(yintercept=0)+
  theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),)+
  ggtitle("K-means Centers k=4 train")

graph_cluster_centers_test<- ggplot(cluster_centers_test_transpose, aes(x = variable, y = value))+
  geom_line(aes(color =cluster,group = cluster), linetype = "dashed",size=1)+
  geom_point(size=1,shape=4)+geom_hline(yintercept=0)+
  theme(text = element_text(size=10),
        axis.text.x = element_text(angle=45, hjust=1),)+
  ggtitle("K-means Centers k=4 test")


graph_cluster_centers_train/graph_cluster_centers_test

train_silhouette<-fviz_silhouette(model_km4_train)+
  ggtitle(paste("k = 4 train", "avg sw=",
                format(round(model_km4_train$silinfo$avg.width,3))))
##   cluster size ave.sil.width
## 1       1  437          0.00
## 2       2  464         -0.01
## 3       3  815          0.20
## 4       4   96          0.11
test_silhouette<-fviz_silhouette(model_km4_test)+
  ggtitle(paste("k = 4 test", "avg sw=",
                format(round(model_km4_test$silinfo$avg.width,3))))
##   cluster size ave.sil.width
## 1       1  916          0.21
## 2       2  373         -0.05
## 3       3  448          0.00
## 4       4   76          0.22
train_silhouette+test_silhouette

Question 8: Based on the results, what can you conclude? Are you more or less confident in your results?

We got similar results of clusters for the training and testing data. The clusters numbers change from training to testing but the same clusters are still represented. The sizes of the clusters are also very similar between the training and the testing sets. The silhouette scores are also very similar and surprisingly improve in the testing set. As the distinct clusters exist in both data sets, the silhouette scores are similar and the sizes of the clusters are similar, we can be more confident in our results.

We can show the clusters in each set with their features (the variables they have the highest value with).

Training set:

  • cluster 1: Weekend Entertainment, Sports (cluster 2 in testing data)
  • cluster 2: Day, Weekday, News, Factual and Weather (cluster 3 in testing data) (Maybe old people)
  • cluster 3: Evening, Drama (cluster 1 in testing data)
  • cluster 4: Afternoon, Children and Learning

The PCA and Silhouette visualizations are also very similar. So we are more confident with the clustering results.

10 Conclusions

Question 9: In plain English, explain which clusters you can confidently conclude that exist in the data, based on all your analysis in this exercise.

We can conclude that there exists 4 main clusters in the data. The first is the cluster of users who watch sports and a bit of entertainment on the weekend. The second cluster are children who watch Children and Learning programmes during the afternoon throughout the entire week. The third cluster are users who watch the news, factual and weather programmes during the day and usually on weekdays. The last cluster are users who watch drama in the evening and night.

Do you think you chose the right k? Explain you reasoning.

Based on Silhouette method, the score becomes worse when increase k from 4 to 5. In addition, we can get insightful conclusions based on k=4. Although with k=19 the silhouette score is the lowest, the explainablity will be worse with too many clusters. We believe k=4 is a right number of clusters as it provides a good balance between explainability and a high silhouette score showing that the clusters have low within variance and high between cluster variance.

What assumptions do you think your results are sensitive to? How can you check the robustness of your conclusions? Just explain, you don’t have to carry out the analysis.

With this analysis, we are assuming that users viewing habits will be the same in the future as they are in the dataset. We can check the robustness by repeating the clustering model over time and then testing the performance of models. If the performance stays the same we can conclude that the habits are similar over time.

Finally explain how the information about these clusters can be used to improve viewer experience for BBC or other online video streaming services.

In order to improve the customer experience BBC could schedule more of the programmes that the different clusters enjoy during the times that they usually watch. Examples of this would be more sport channels on the weekend or more drama shows during the evening and night. BBC could also produce more types of shows related to the different clusters to give the viewers more options of shows they like. By scheduling and producing more shows that each cluster likes, BBC will be giving the viewers more of what they enjoying watching. This will lead to an improved viewing experience and hopefully less churn.

BBC could also do more marketing campaigns targeting the different groups of customers based on the clustering results.